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Abstract 

Information theory is a powerful tool to express principles to drive autonomous systems because it is 
domain invariant and allows for an intuitive interpretation. This paper studies the use of the predictive 
information (PI), also called excess entropy or effective measure complexity, of the sensorimotor process 
as a driving force to generate behavior. We study nonlinear and nonstationary systems and introduce 
the time-local predicting information (TiPI) which allows us to derive exact results together with explicit 
update rules for the parameters of the controller in the dynamical systems framework. In this way the 
information principle, formulated at the level of behavior, is translated to the dynamics of the synapses. 
We underpin our results with a number of case studies with high-dimensional robotic systems. We show 
the spontaneous cooperativity in a complex physical system with decentralized control. Moreover, a 
jointly controlled humanoid robot develops a high behavioral variety depending on its physics and the 
environment it is dynamically embedded into. The behavior can be decomposed into a succession of 
low-dimensional modes that increasingly explore the behavior space. This is a promising way to avoid 
the curse of dimensionality which hinders learning systems to scale well. 

1 Introduction 

Autonomy is a puzzling phenomenon in nature and a major challenge in the world of artifacts. A key feature 
of autonomy in both natural and artificial systems is seen in the ability for independent exploration [9 |. In 
animals and humans, the ability to modify its own pattern of activity is not only an indispensable trait 
for adaptation and survival in new situations, it also provides a learning system with novel information 
for improving its cognitive capabilities, and it is essential for development. Efficient exploration in high- 
dimensional spaces is a major challenge in building learning systems. The famous exploration-exploitation 
trade-off was extensively studied in the area of reinforcement learning [60|. In a Bayesian formulation 
this trade-off can be optimally solved [22], however it is computationally intractable. A more conceptual 
solution is to provide the agent with an intrinsic motivation ll5Tll55ll for focusing on certain things and thus 
constraining the exploration to a smaller space. To approach this problem in a more fundamental way we 
consider mechanisms for goal-free exploration of the dynamical properties of a physical system, e. g. a 
robot. If the exploration is rooted in the agent in a self-determined way, i. e. as a deterministic function 
of internal state variables and not via a pseudo-random generator it has the chance to escape the curse of 
dimensionality. Why? Because specific features of the system such as constrains and other embodiment 
effects can be exploited to reduce the search space. Thus an exploration strategy taking the particular 
body and environment into account is vital for building efficient learning algorithms for high-dimensional 
robotic systems. But how can goal-free exploration be useful to actually pursue goals? We show that 
a variety of coordinated sensorimotor patterns are formed that may be used to quickly construct more 
complex behaviors using a second level of learning. It may also be used more directly in combination with 
reinforcement learning where the typical random exploration is substituted or augmented by the goal-free 
exploration leading presumably to a large speedup. 
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The solution for such a general problem needs a core paradigm in order to be relevant for a large class of 
systems. In recent years, information theory has come into the focus of researchers interested in a number 
of related issues ranging from quantifying and better understanding autonomous systems [7 , 33 23 56ll24l 
l64l l53l to questions of spontaneity in biology and technical systems ifTUl to the self-organization of robot 
behavior Il2ll65l. 

A systematic approach requires both a convenient definition of the information measure and a robust, 
real time algorithm for the maximization of that measure. This paper studies in detail the use of the 
predictive information (PI) of a robot's sensorimotor process. The predictive information of a process 
quantifies the total information of past experience that can be used for predicting future events. Technically, 
it is defined as the mutual information between the past and the future of the time series. It has been 
argued |8| that predictive information, also termed excess entropy 1131 and effective measure complexity 
|26|, is the most natural complexity measure for time series. By definition, predictive information of the 
sensor process is high if the robot manages to produce a stream of sensor values with high information 
content (in the Shannon sense) by using actions that lead to predictable consequences. A robot maximizing 
PI therefore is expected to show a high variety of behavior without becoming chaotic or purely random. In 
this working regime, somewhere between order and chaos, the robot will explore its behavioral spectrum 
in a self-determined way in the sense discussed above. 

This paper studies the control of robots by simple neural networks whose parameters (synaptic strengths 
and threshold values) are adapted on-line to maximize (a modified) PI of the sensor process. These rules 
define a mechanism for behavioral variability as a deterministic function formulated at the synaptic level. 
For linear systems a number of features of the PI maximization method have been demonstrated 0. In 
particular, it could be shown that the principle makes the system to explore its behavior space in a systematic 
manner. In a specific case, the PI maximization caused the controller of a stochastic oscillator system 
to sweep through the space of available frequencies. More importantly, if the world is hosting a latent 
oscillation, the controller will learn by PI maximization to go into resonance with this inherent mode of the 
world. This is encouraging, since maximizing the PI means (at least in this simple example) to recognize 
and amplify the latent modes of the robotic system. 

The present paper is devoted to the extension of the above mentioned method to nonlinear systems with 
nonstationary dynamics. This leads to a number of novel elements in the present approach. Commonly 
information theoretic measures are optimized in the stationary state. This is not adequate for a robot in 
a self-determined process of behavioral development. This paper develops a more appropriate measure 
for this purpose called the time-local predictive information (TiPI) for general nonstationary processes by 
using a specific windowing technique and conditioning. Moreover, the application of information theoretic 
measures in robotics is often restricted to the case of a finite state-action space with discrete actions and 
sensor values. Also these restrictions are overcome in this paper so that it can be used immediately in 
physical robots with high dimensional state-action space. This will be demonstrated by examples with 
two robots in a physically realistic simulation. The approach is seen to work from scratch, i. e. without 
any knowledge about the robot, so that everything has to be inferred from the sensor values alone. In 
contrast to the linear case the nonlinearities and the nonstationarity introduce a number of new phenomena, 
for instance the self-switching dynamics in a simple hysteresis system and the spontaneous cooperation 
of physically coupled systems. In high-dimensional systems we observe behavioral patterns of reduced 
dimensionality that are dependent on the body and the environment of the robot. 

1.1 Relation to other work 

Finding general mechanisms that help robots and other systems to more autonomy, is the topic of intensive 
recent research. The approaches are widely scattered and follow many different routes so that we give in 
the following just a few examples. 

1.1.1 Information theoretic measures 

Information theory has been used recently in a number of approaches in robotics in order (i) to understand 
how input information is structured by the behavior [34, 33 1 and (ii) to quantify the nature of information 
flows inside the brain ll23l l56l l24l and in behaving robots Il53ll64l . An interesting information measure is 
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the empowerment, quantifying the amount of Shannon information that an agent can "inject into" its sensor 
through the environment, affecting future actions and future perceptions. Recently, empowerment has been 
demonstrated to be a viable objective for the self-determined development of behavior in the pole balancer 
problem and other agents in continuous domains [28|. 

Driving exploration by maximizing PI can also be considered as an alternative to the principle of 
homeokinesis as introduced in Der and Liebscher fl5l . Der lfl4ll that has been applied successfully to a 
large number of complex robotic systems, see Der et al. 1(191 118, 20|, Der and Martius 1 16| and the recent 
book Der and Martius ifTTl . Moreover, this principle has also been extended to form a basis for a guided 
self-organization of behavior lT39l[T7ll . 

1.1.2 Intrinsic motivation 

As mentioned above, the self-determined and self-directed exploration for embodied autonomous agents is 
closely related to many recent efforts to equip the robot with a motivation system producing internal reward 
signals for reinforcement learning in pre-specified tasks. Pioneering work has been done by Schmidhuber 
using the prediction progress as a reward signal in order to make the robot curious for new experiences iBUl 
[58] [52]. Related ideas have been put forward in the so called play ground experiment [29 43|. There have 
been also a few proposals to autonomously form a hierarchy of competencies using the prediction error 
of skill models J3] or more abstractly to balance skills and challenges l57l . Predictive information can 
also be used as an intrinsic motivation in reinforcement learning ll66l or additional fitness in evolutionary 
robotics J46). 

1.1.3 Embodiment 

The past two decades in robotics have seen the emergence of a new trend of control in robotics which 
is rooted more deeply in the dynamical systems approach to robotics using continuous sensor and action 
variables. This approach yields more natural movements of the robots and allows to exploit embodiment 
effects in an effective way, see Pfeifer and Bongard [44], Pfeifer et al. [45] for an excellent survey. The 
approach described in the present paper is tightly coupled to the ideas of exploiting the embodiment, since 
the development of behavioral modes is entire dependent on the dynamical coupling of the body, brain, and 
its environment. 

1.1.4 Spontaneity 

We would like to briefly discuss the implications of using a self-determinecQand deterministic mechanism 
of exploration to the understanding of variability in animal behavior. In the animal kingdom, there is 
increasing evidence showing that animals from invertebrates to fish, birds, and mammals are equipped 
with a surprising degree of variety in response to external stimulation J4j[25]|59l[6l. So far, it is not clear 
how this behavioral variability is created. Ideas cover the whole range from the quantum effects |30| 
(pure and inexorable randomness) to thermal fluctuations at the molecular level to the assumption of pure 
spontaneity [42], rooting the variability in the existence of intrinsic, purely deterministic processes. 

This paper shows that a pure spontaneity is enough to produce behavioral variations, and as in animals, 
their exact source appears "indecipherable" from an observer point of view. If the variation of behavior in 
animals is produced in a similar way, this would bring new insights into the free will conundrum [10|. 

2 Methods 

We start with the general expressions for the predictive information (PI) and introduce a derived quantity 
called time-local predictive information (TiPI) more suitable for the intended treatment of nonstationary 
systems. Based on the specific choice of the time windows we derive estimates of the TiPI for general 
stochastic dynamical systems and give explicit expressions for the special case of a Gaussian noise. The 
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explicit expressions are used for the derivation of the parameter dynamics of the controller (exploration dy- 
namics) obtained by gradient ascending the TiPI. Besides giving the exploration dynamics as a batch rule 
we also derive, in the sense of a stochastic gradient rule, the one-shot gradient. The resulting combined dy- 
namics (system plus exploration dynamics) is a deterministic dynamical system, where the self-exploration 
of the system becomes a part of the strategy. These general results are then applied to the case of the sen- 
sorimotor loop and we discuss their Hebbian nature. 

2.1 Predictive information 

The PI of a time discrete process {S,}^ =a with values in M." is defined [8] as the mutual information between 
the past and the future, relative to some instant of time a < t o < b 

I (-^future \ Spast) = (in S r — - ^— \ = H (Sfuture ) — H ( Sfuture | Spast) ( 1 ) 

\ P(,«pastJP( s fiiture) / 

where the averaging is over the joint probability density distribution p (s pas t , ^future ) with past := {a, . . . ,to} 
and future := {to + 1, . . . , b}. In more detail, we use the (differential) entropy H (S) of a random variable S 
given by 



H(S) = -J p(s)]np(s) ds 



where p(s) is the probability density distribution of the random variable S. The conditional entropy 
H (S t \S t -i) is defined accordingly 



H(S t \S,-i) = —J J p(s t \st-i)iap(st\s t -i) ds, p(s t ^i)ds t ^i 



p(s,\s t -i) being the conditional probability density distribution of s, given St-i. As is well known, in 
the case of continuous variables, the individual entropy components H (Sfuture ), H (Sfuture |«Spast) mav we U 
be negative whereas the PI is always positive and will exist even in cases where the individual entropies 
diverge. This is a very favorable property deriving from the explicit scale invariance of the PI |2|. 

The usefulness of the PI for the development of explorative behaviors of autonomous robots has been 
discussed earlieij^] see Ay et al. HI, Der et al. ED . Ay et al. Q. This paper continues these investigations 
for the case of more general situations than those discussed before. In order to do so, we have to introduce 
some specifications necessary for the development of a versatile and stable algorithm realizing the increase 
of PI in the sensor process at least approximately. 

Let us start with simplifying eq. ([TJ. If {S t }^ =a is a Markov process, see Ay et al. JT], the PI is given by 
the mutual information (MI) between two successive time steps, i. e. instead of eq. ([T| we have 

/(M-i) = /in P ^ S '' S l~ 1 \ ) =H(S t )-H(S t \St- 1 ) (2) 



'p(s t )p(s t -i 

the averaging being done over the joint probability density p (s t ,s t -i). Actually, any realistic sensor process 
will only be in exceptional cases purely Markovian. However, we can use the simplified expression |2]) — 
let us call it the one-step PI — also for general sensor processes taking it as the definition of the objective 
function driving the autonomous exploration dynamics to be derived. 



2 In experiments with a coupled chain of robots 1211 it was observed that the PI of just a single sensor, one of the wheel counters 
of an individual robot, already yields essential information on the behavior of the robot chain. The PI turned out to be maximal 
if the individual robots managed to cooperate so that the chain as a whole could navigate effectively. This is remarkable in that a 
one-dimensional sensor process can already give essential information on the behavior of a very complex physical object under real 
world conditions. These results give us some encouragement to study the role of PI and other information measures for specific sensor 
processes as is done in the present paper. 
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2.1.1 Nonstationarity and time-local predictive information (TiPI) 

Most applications done so far were striving for the evaluation of the PI in a stationary state of the system. 
With our robotic applications, this is neither necessary nor adequate. The robot is to develop a variety of 
behavioral modes ideally in a open-ended fashion, which will certainly not lead to a stationary distribution 
of sensor values. The PI would change on the timescale of the behavior. How can one obtain in this case 
the probability distributions of p(s t )l The solution we suggest is to introduce a conditioning on an initial 
state in a moving time window and thus obtain the distributions from our local model as introduced below. 
More formally, let us consider the following setting. Let t be the current instant of time and z be the length 
of a time window T steps into the past. We study the process in that window with a fixed starting state s t - T 
so that all distributions in eq. |2) are conditioned on state s,_ T . For instance, instead of p (s t ) in eq. pi, we 
have to us£| 

P (st\s t -x) = j ' ■■■ j P Or, ■ • • ,^-t+i \s t -z) dSf-i ' ' ' ds,- T+ i (3) 

and the related expression for p (s t ,s t -\ \s t -t), where p(s t ,... ,s t -%+\ \st-t~) is the joint probability distribu- 
tion for the process in the time window, conditioned on s,_ T . As to notation, the conditional probabilities 
depend explicitly on time so that p (s t i \s,i_ i ) is different from p (s t n \s t n_ { ) in general if t' ^ t " , with equal- 
ity only in the stationary state. As a result we obtain the new quantity, written in a short-hand notation as 



r(S t ;S t ^):=I{S t -S t ^\S t ^=s t ^) (4) 

which we call time-local predictive information (TiPI). Note the difference to the conditional mutual infor- 
mation where an averaging over s,_ T would take place. Analogously we define the time local entropy as 



H T (S t ) := H (S,\S t - T = s t - t ) (5) 

2.2 Estimating the TiPI 

To evaluate the TiPI only the kernels have to be known which can be sampled by the agent on the basis of 
the measured sensor values. However, in order to get explicit update rules driving the increase of the TiPI, 
these kernels have to be known as a function of the parameters of the system, in particular those of the 
controller. This can be done by learning the kernels as a function of the parameters. A related approach, 
followed in this paper, is to learn a model of the time series, i.e. learning a function yf : M." — > M" acting as 
a time series predictor S t = \]/(S t -\) +Z with realization 

s t = Y(s t -i) + ^ t (6) 

for any time t , % t being the prediction error, also called the noise in the following. \j/ can be realized for 
instance by a neural network that can be trained with any of the standard supervised learning techniques. 
A concrete example will be considered below, see eq. ( |25] l. The relation to the kernel notation is obtained 
by observing that 

p(*»k-i) = f8(st-Y(s,-i)-Z,)pE&)dZ,=pE(s t -Y(s,-i)) (7) 

where 8 (x) is the Dirac delta distribution and p? (§) is the probability density of the random variable E 
(prediction error) which may depend on the state s itself (multiplicative noise). 

The case of linear systems, where \j/ (s) = Ls + b with a constant matrix L, has been treated in Ay et al. 
[2 1 revealing many interesting properties of the PI. How can we translate the findings of the linear systems 



3 In the Markovian case this boils down to 

p(s,\s t -r) = j ■■■ J p(st\st-l)-"P(s t -i+i\st-x)dst-i ■■■ dSf-x+l 
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Figure 1 : The time window and the error propagation dynamics used for calculating the TiPI, eq. ( |4T| >. 

In principle, the process is considered many times with always the same starting value but different real- 
izations of the noise £, . Note that, when using the one-shot gradients, only one realization is needed, see 
section |231 



to the case of nonlinear systems? As it turns out, the nonlinearities introduce many difficulties into the 
evaluation of the PI as it becomes clear already in a one-dimensional bistable system as treated in Ay et al. 
0]. Higher dimensional systems bring even more of such difficulties so that we propose to consider the 
information quantity on a new basis. The idea is to study the TiPI of the error propagation dynamics in the 
stochastic dynamical system instead of the process S t itself. 



2.2.1 Error propagation dynamics 

Let us introduce a new variable describing the deviation of the actual dynamics, eq. i5l, from the determin- 
istic prediction in a certain time window. We define for a time window starting at time t — X 

8s t , = s t , - /-('-*) ( if _ T ) (8) 

for any time ?' with t — X < t' < t and (s) = s. As to notation, 8s denotes a single variable not to be 
confused with the Dirac function. Intuitively 8s,i captures how the prediction errors occurred since the start 
of the time window are propagated up to time t'. Figure[T]illustrates the transformed state and the relevant 
distributions of the belonging process 8S t . Interestingly the TiPI on the process 8S is equivalent to the one 
on the original process S, see eq. ( pT) . The dynamics of the 8s can be approximated by linearization as 

S.y =£(V-i)Sv-i+4'+0(ll^|| 2 ) (9) 
using the Jacobian 

L iJ (s)= d ^l (10) 

Assuming the prediction errors (noise) ^ to be both small and Gaussian we obtain an explicit expression 
for the TiPI on 8S 

I T (8S t :8S t - l ) = ^ln\L t \-^]n\D t \ (11) 

where £ = (5s5s T ) is the covariance matrix of SS and D — (4<jj T ) is the covariance matrix of the noise. 
The derivation and further details are in the Appendix section[A] The results for linear systems in Ay et al. 
[2 1 can be obtained from the general case considered here by x — s- °°. 

When looking at eq. ( fPT) one sees that the entropies are expressed in terms of covariance matrices. 
This is exact in the case of Gaussian distributions. In the general case this may be considered as an 
approximation to the true TiPI. Alternatively, we can also consider eq. (JTTJ as the definition of a new 
objective function for any process if we agree to measure variability not in terms of entropies but more 
directly in terms of the covariance matrices. 
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2.3 The exploration dynamics 

Our aim is the derivation of an algorithm driving the behavior of the agent toward increasing TiPI. Let us 
assume that the function l/f : M" —> W depends on a set of parameters 9 so that we may write the dynamics 
as 

fft = V(*t-i,$) + 6+i (12) 

For instance, if y/ is a neural network as introduced further below, the parameter set 9 comprises just the 
synaptic weights and threshold values of the neurons. 

2.3.1 Gradient ascending the TiPI 

Based on the TiPI, eq. ( fTT| , a rule for the parameter dynamics is given by the gradient step to be executed 
at each time t 

A0 ' = 4? = 4 ln|E '' (13) 

where e is the update rate and 9 t +\ = 9 t + A9,. The term In \D\ from eq. ( fTT| has been omitted assuming 
that 4 is essentially noise which is not depending on the parameters of the controller. This is justifiable 
in the case of parsimonious control as realized by the low-complexity controller networks. These generate 
typically well predictable (low noise) behaviors as shown in the applications studied below. 

In order to get more explicit expressions, let us consider the case of very short time windows. With 
T = 1 there is no learning signal since E = D meaning that F = 0. So, T = 2 is the most simple nontrivial 
case. The parameter dynamics is given by 



A9, = sou — ^ -Ss t -i 

o9 

where 8 s and the auxiliary vector 5 m are given as 

Sst-i =s,-i -\jf(st- 2 ) 
8s t =s t -Y(\if(s t - 2 )) 
8u t = Y. t ~ l Ss t 

E, = (^8s t 8sJ^ 



(14) 



(15) 
(16) 
(17) 

(18) 



stipulating the noise is different from zero (though possibly infinitesimal) and employing the self-averaging 
property of a stochastic gradient, see below. The general parameter dynamics for arbitrary z is derived in 
Appendix section|B] However, in the applications described below, already the simple parameter dynamics 
with T = 2 will be seen to create most complex behaviors of the considered physical robots. 

In a nutshell, eq. ( fT3] l reveals already the main effect of TiPI maximization: increasing |E| means 
increasing the norm of 8s (in the E-metric see eq. (f56])). This is achieved by increasing the amplification 
of small fluctuations in the sensorimotor dynamics which is equivalent to increasing the instability of the 
system dynamics, see also the more elaborate discussion in Der et al. [21 1. 

2.3.2 Learning vs. exploration dynamics 

Usually, updating the parameters of a system according to a given objective is called learning. In that 
sense, the gradient ascent on the TiPI defines a learning dynamics. However, we would like to avoid this 
notion here, since actually nothing is learnt. Instead by the interplay between the system and the parameter 
dynamics, the combined system never reaches a final behavior corresponding to the goal of a learning 
process. Therefore we prefer the notion exploration dynamics for the dynamics in the parameter space that 
is driven by the TiPI maximization. 
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2.3.3 One-shot gradients 

The formulas for the gradient (eqs. ( fT~4| > and ((56])) were obtained by tacitly invoking the self-averaging 
properties of the gradient, i.e. by simply replacing (SsSs T } with 8s8s T in eq. ( f5T~| >. This still needs a 
little discussion. Actually, the self-averaging is exactly valid only in the limit of sufficiently small e, with 
e eventually being driven to zero in a convenient way. However, our scenario is different. What we are 
aiming at is the derivation of an intrinsic mechanism for the self-determined and self-directed exploration 
using the TiPI and related objectives. The essential point is that self-exploration is driven by a deterministic 
function of the states (sensor values) of the system itself. 

Equation ( fl4] > obtained from the gradient of the TiPI fulfills these aims very well — any change of the 
system parameters and hence of the behavior is given in terms of the predecessor states in the short time 
window. With finite (and often quite large) e eqs. (fl4|i-(fT8"|l are just a rough approximation of the original 
TiPI but, in view of our goal, the one-shot nature of the gradient is favorable as it supports the explorative 
nature of the exploration dynamics generating interesting synergy effects. 

2.3.4 Synergy of system and exploration dynamics 

A further central aspect of our approach is the interplay between the system and the parameter dynamics 
driven by the TiPI maximization process. In specific cases, the latter may show convergence as in con- 
ventional approaches based on stationary states. An example is given by the one-parameter system studied 
in Ay et al. HI realizing convergence to the so called effective bifurcation point. However, with a richer 
parametrization and/or more complex systems, instead of convergence, the combined system (state + pa- 
rameter dynamics) never comes to a steady state due to the intensive interplay between the two dynamical 
components if e is kept finite. An example will be given in the Results section. 

Typically, the TiPI landscape permanently changes its shape due to the fact that increasing the TiPI 
means in general a destabilization of the system dynamics. If the latter is in an attractor, increasing the 
TiPI destabilizes the attractor until it may disappear altogether with a complete restructuring of the TiPI 
landscape. This is but one of the possible scenarios where the exploration dynamics engages into an 
intensive and persistent interplay with the system dynamics. This interplay leads to many synergistic 
effects between system and exploration dynamics and makes the actual flavor of the method. 

2.3.5 Self-directed search 

The common approach to solve the exploration-exploitation dilemma in learning problems is to use some 
randomization of actions in order to get the necessary exploration and then decrease the randomness to 
exploit the skills acquired so far. This is prone to the curse of dimensionality if the systems are gaining some 
complexity. Randomness can also be introduced by using a deterministic policy with a random component 
in the parameters, as quite successfully applied to evolution strategies and reinforcement learning Il54ll27l . 

Our approach is also to use deterministic policies (given by the function K) but aims at making ex- 
ploration part of the policy. So, instead of relegating exploration to the obscure activities of a random 
number generator, variation of actions should be generated by the responses of the system itself. This 
replaces randomness with spontaneity and is hoped (and will be demonstrated) to restrict the search space 
automatically to the physically relevant dimensions defined by the embodiment of the system. 

Formally, we call a search self-directed if there exists a function a so that the change in the parameters 



A6 t = a{s t ,...,s t -*,6t) (19) 

is given as a deterministic function of the states in a certain time window (of length t) and the parameter set 
9 itself. In this paper, a is given by the gradient of the predictive information in the one-shot formulation, 
see section |2~. 3. 3 1 

In more general terms, we believe that randomization of actions makes the agent heteronomous, its 
fate being determined by an obscure (to him) procedure (the pseudo-random number generator) alien to 
the nature of its dynamics. The agent is autonomous in the 'genuine' sense only if it varies its actions 
exclusively by its own internal laws (49]. In our approach, according to eq. ( fT9| ), exploration is driven 
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entirely by the dynamics of the system itself so that exploration is coupled in an intimate way to the pattern 
of behavior the robot is currently in. The danger might be that in this way the exploration is restricted too 
much. As our experiments show, this is not so for active motion patterns in high dimensional systems. This 
fact can be attributed to the destabilization effect incurred by the TiPI maximization, see above and [21 1. 
For stabilizing behaviors, however, the exploration may be too restrictive. 

2.4 The sensorimotor loop 

Let us now specify the above expressions to the case of a sensorimotor loop, in particular a neurally 
controlled robotic system. The dynamical systems formulation is obtained now by writing our predictor 
for the next sensor values as a function of both the sensors and the actions so that 

*f = 0(j f -i,flf-i) + 6 (20) 

where represents the so-called forward model and %t is the prediction error as before. As the next step, 
we consider the controller also as a deterministic function K : W —> M.'" generating actions (motor values) 
a t € W" as a function of the sensor values s t g W so that 

a t =K(s t ) (21) 



In the applications, K will be realized as a (feed-forward) neural network. Using eq. plj i in eq. ( |20] > we 
obtain the map y/ modeling our sensor process as 

W(s t -i) = <t>{s,-i, K(s,-i)) (22) 

In Ay et al. a standard linear control system was studied where K(s) = Cs, (s,a) = Ts + Va and 
y/(s) = (T +VC)s. This paper will consider a nonlinear generalization of that case in specific robotic 
applications. 

2.4.1 Exploration dynamics for neural control systems 

In the present setting, we assume that both the controller K and the forward model (j) of our robot are 
realized by neural networks, the controller being given by a single-layer neural network as 

K(s)=g(Cs + h) (23) 

the set of parameters 6 now given by C and h. In the concrete applications to be given below, we specifically 
use gi (z) = tanh (z,) (to be understood as a vector function so that g : M n —> W). 
Moreover, the forward model (j) is given by a layer of linear neurons, so that 

(s,a) =Va + Ts + b (24) 

The matrices V, T and the vector b represent the parametrization of the forward model that is adapted 
on-line by a supervised gradient procedure to minimize the prediction error as 

AV = Tj^ fl T , AT = ri^s T , Ab = rj^ (25) 

In the applications, the learning rate 7]$ is large such that the low complexity of the model is compensated 
by a very fast adaptation process. 

In contrast to the forward model parameters, the controller parameters are to be adapted to maximize the 
TiPI. For that the map y/ (eq. ( f2"2| )) is required which becomes \j/ (s) = Vg (Cs + h) + Ts + b with Jacobian 
matrix 

L = VG'{z)C + T (26) 
where z = Cs + h is the postsynaptic potential and 

G'(z)=6iag{g'(zi),...,g'(z m )} (27) 
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is the diagonal matrix of the derivatives of the activation functions for each control neuron. 

In the applications given below, we are using the short-time window, with the general exploration 
dynamics given by eq. ( |14) . The explicit exploration dynamics for this neural setting with g(z) = tanh(z) 
are given as 

-AQj = dlliSsj - jiOiSj (28) 

-Alt, = -jiUi (29) 
e 

where all variables are time dependent and are at time f, except 8s which is at time t — I, The vector 
8jJ, <E W is defined as 

Slit = G'V T Su t = G'V J Y.- l 8s t (30) 

(see eq. (fT7]i), and the channel specific learning rates ji are 

Yi = 2{C8s t -{) i 8 i i i (31) 

The derivation and generalization to aribrary activation functions are provided in the Appendix section [C] 
The update rules for T > 2 are given by a sum of such terms, with appropriate redefinitions of the vector 
SjJ,, see eq. 



2.4.2 The Hebbian nature of the update rules 

In order to interpret these rules in more neural terms, we at first note that the last term in eq. ( f28) is of an 
anti-Hebbian structure. In fact, it is given by the product of the output value a,- of neuron i times the input 
Sj into the j-th synapse of that neuron, the Yi (which are positive, as a rule) being interpreted as a neuron 
specific learning rate. Moreover, we may also consider the term 8/J.jSsj as a kind of Hebbian since it is 
again given by a product of values that are present at the ports of the synapse j of neuron i. The factor 8s j 
can be considered as a signal directly feeding into the input side of the synapse C, ; -. Moreover, 8ji given 
as Sfi = G'V J Su is obtained by using 8u as the vector of output errors in the y/ network and propagating 
this error back to the layer of the motor neurons by means of the standard backpropagation algorithm. 

These results make the generalization to more complicated, multi-layer networks straightforward. 
However already the simple setting produces an overwhelming behavioral variety, see the case studies 
below. 

More intuitively the Hebbian term acts as a self-amplification and increases the Lyapunov exponents. 
In the linear case Q this leads eventually to the divergence of the dynamics such that the PI does not exist 
any longer. With the nonlinearities, the latter effect is avoided, but the system is driven into the saturation 
region of the motor neurons. However, the second term in eq. (ESJ, by its anti-Hebbian nature, is seen 
to counteract this tendency. The net effect of both terms is to drive the motor neurons towards a working 
regime where the reaction of the motors to the changes in sensor values is maximal. This is understandable, 
given that maximum entropy in the sensor values requires a high sensorial variety that can be achieved by 
that strategy. 



3 Results 

We apply our theory to three case studies to illuminate the main features. First a hysteresis systems is 
considered to exemplify the consequences of nonstationarity and the resulting interplay between the explo- 



ration dynamics and the system dynamics in a nutshell. In section 3.2 a physical system of many degrees 
of freedom is controlled by independent controllers that spontaneously cooperate. Finally in section 3.3 we 
apply the method to a jointly controlled humanoid robot in various situations to illustrate the exploration 
process in a high-dimensional embodied system. 
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Figure 2: The hysteresis cycle in the gradient picture. The diagrams show the stages of one hysteresis 
cycle starting from h = (A) with the state at z > as represented by the sphere. Decreasing h creates 
the asymmetric situation (B). If h = —h c the saddle-node bifurcation happens, i. e. both the maximum at 
z=0 and the right minimum disappear so that the system shifts to the left minimum of the potential (C). 
Increasing h until h = brings us back to the initial situation with the state shifted to the other well see 
(D,E). The diagrams (F) and (G) depict the switching from the minimum at z < to the minimum at z > 
by increasing h. By decreasing h until h = the hysteresis cycle is finished, see (H,l). 



3.1 Hysteresis systems 

Nonstationary processes are the main target of our theory, made accessible by the special windowing and 
averaging technique presented in this paper for the first time. In order to work out the consequences, let 
us consider an idealized situation where the above derivations, in particular eqs. (fT4|)-(fl"8j), are the exact 
update rules for increasing the TiPI. 

Let us consider a single neuron in an idealized sensorimotor loop, where the sensor values are s t — 
a,_i + 4, (the white Gaussian noise | is added explicitly). This case corresponds to the dynamical system 

s f = tanh(C*,_i +/») + § (32) 

where now s t eR 1 . The system was studied earlier |2T| in the special case of h = and it was shown 
that the maximization of the PI self -regulates the system parameter C towards a slightly supercritical value 
(1 < C <C 2). There, the system is at the so called effective bifurcation point where it is bistable but still 
sensitive to the noise. 

Let us start with keeping C fixed at some supercritical value (e. g. C — 1.1) and concentrating on the 
behavior of the bistable system as a function of the threshold value h. The interesting point is that the 
system shows hysteresis. This can be demonstrated best by rewriting the dynamics in state space as a 
gradient descent. Let us introduce the postsynaptic potential it = Cs, + h and rewrite eq. < [32] > in terms of Zt 

as 

Az t = - ir U(z t ) + $ t+1 (33) 

where Az t — Zt+i — Zt and the potential is U (z) = — Clncoshz — hz + \ (using 4- lncoshz = tanhz). In that 
picture, the hysteresis properties of the system are most easily demonstrated by Fig. [2] This phenomenon 
can be related directly to the destabilization effect of the exploration dynamics. In the potential picture, 
stability is increasing with the well depth. Hence, the exploration dynamics, aiming at the destabilization 
of the system, is decreasing the depth of the well more and more until the well disappears altogether, see 
Fig. [2] and the state switches to the other well where the procedure restarts. 
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Figure 3: State and parameter dynamics in the one-dimensional system. (A) Only h dynamics (fixed 
C = 1 .2); the bias h oscillates around zero and causes the state s to jump between the positive and negative 
fixed points. The TiPI is seen to increase steadily until it eventually drops back when the state is jumping. 
(B) With full dynamics (C,/i). C increases until it oscillates around its average at C « 1.2 where the 
hysteresis cycle starts. Parameters: /io = 0.1, so = 0.8, Co = in (B), e = 0.002 



3.1.1 Deterministic self-induced hysteresis oscillation 

Now we show that in the one-dimensional case the parameter dynamics is independent of white noise. 
This implies we can in the state dynamics make the limit of vanishing noise strength and obtain a fully 
deterministic system. Again we only consider the two-step window (t = 2). Using 8s, = % t + L5s,_i = 
4, +I4t-l ( e q- ©) we nnd that the TiP 1 ' according to eq. (|50j 

/ T =lln(l+L 2 ) 

is independent of the noise. Analogously to eqs. (|2"8|i-(|3"Tji we obtain the update rules for C and h as the 
gradient ascent on P and thus the full state-parameter dynamics (with |§| —> 0) is given by 

St=g(C t -iS t - 1 +h t -i) (34) 
Q = Q_ i + 7 ( 1 / (2C,_ i ) - i,_ i a,_ i ) (35) 
ht=ht-i-ys, (36) 

vi\\hy=2L 2 /{\+L 2 ). 

Apart from the definition of y (that just modulates the speed of the parameter dynamics), the extended 
dynamical system agrees in the one-dimensional case with that derived from the principle of homeokinesis, 
discussed in detail in Der and Martius ifTTl . Let us therefore only briefly sketch the most salient features 
of the dynamics. Keeping C fixed at some supercritical value, as above the most important point is that, 
instead of converging towards a state of maximum TiPI, the h dynamics drives the neuron through its 
hysteresis cycle as shown in Fig.|2]which we call a self-induced hysteresis oscillation, see Fig. [3jA). 

For the full dynamics (with eqT([35]l) the results are given in Fig.[3|B) showing that the feedback strength 
C in the loop converges indeed toward the regime with the hysteresis oscillation. This demonstrates that 
the latter is not an artifact present only under the specific parametrization. In fact, we encounter this 
phenomenon in many applications with complex high-dimensional robotic systems, see the experiments 
with the Armband below and many examples treated in Der and Martius IfTTl . 

Interestingly this behavior is not restricted to simple hysteresis systems but is of more general relevance. 
For instance, in two-dimensional systems a second order hysteresis was observed, corresponding to a sweep 
through the frequency space of the self-induced oscillations IfTTl . It would be interesting to relate this fast 
synaptic dynamics to the spike-timing-dependent plasticity 11381 or other plasticity rules [63 1 found in the 
brain. 
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Figure 4: The probability density distributions with different time windows of the stochastic process 
in an asymmetric double well potential. The mean first passage time Tf of switching between wells is 
one characteristic time constant of the process [48], Tf increasing exponentially with the barrier height. 
If observing the process in a window of length T ^> Tf, the distribution of (A) will be observed. In that 
situation, the TiPI is maximal if the wells are of equal depth (h — 0). However, with windows of length 
T <C Tf, the system state will be predominantly in one of the wells generating the distributions shown in 
(B), (C). Gradient ascending the TiPI will decrease the well depth as long as the probability mass is still 
concentrated in that well. This is what drives the hysteresis cycle depicted in Fig. [2] 

3.1.2 About time windows 

Before giving the applications to embodied systems, let us have a few remarks on the special nature of the 
time windowing technique as compared to the common settings. Let us consider again the bistable system 
with the bias h as the only parameter and with finite noise. Figure |4] depicts a typical situation with h ^ 
so that the wells are of different depth. The figure depicts the qualitative difference between the classical 
attitude of considering information measures in very large time windows, large enough for the process to 
reach total equilibrium, as compared to our nonstationarity approach where the TiPI is estimated on the 
basis of a comparatively short window^] 

While in the former case convergence of the hysteresis parameter h towards the equilibrium condition 
h = is reached, there is no convergence in the nonstationary case. Instead, one obtains a self-induced 
hysteresis oscillation. This is generic for a large class of phenomena based on the synergy effects between 
system and exploration dynamics, see section |2~. 3. 4| which open new horizons for the explorative capabili- 
ties of the agent. In the context of homeokinesis, this phenomenon has already been investigated in many 
applications, see Der and Martius [fTTl . This paper provides a new, information theoretic basis and opens 
new horizons for applications as the matrix inversions inherent to the homeokinesis approach are avoided. 



3.2 Spontaneous cooperation with decentralized control 

Let us now give examples illustrating the specific properties of the present approach. We start with an 
example of strongly decentralized control where the TiPI driven parameter dynamics leads to the emergence 
of collective modes. Earlier papers have already demonstrated this phenomenon for a chain of passively 
coupled mobile robots |T|[2T1[65]. In me setting of Ay et al. (TJ, Der et al. ll2D . each wheel was being 
controlled by a single neuron with a synapse of strength C defining the feedback strength in each of the 
sensorimotor loops. There was no bias. As it turned out, the TiPI in the sensorimotor loop is maximal if 
the synaptic strength C is at its critical value where the system is bistable but still reacts to the external 
perturbations, i. e. the system is at its so-called effective bifurcation point 1 17 1. As compared to the present 
setting, these results correspond to using a time window of infinite length, stipulating the presence of a 
stationary state. 

The situation is entirely different when using the short time window and large update rates allowing for 
the synergy effects. In experiments with the robot chain, we observe better cooperativity with the hysteresis 
oscillations and better exploration capabilities. The reason can be seen in the fact that the self-regulated 

4 Note that the time to stay in a well is exponentially increasing with the depth of the well and decreasing exponentially with the 
strength of the noise 1 48 1 . Mean first passage times can readily exceed physical times (on the time scale of the behavior) by orders of 
magnitudes. 
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Figure 5: The ARMBAND. The robot has 12 hinge and 6 slider joints, each actuated by a servo motor 
and equipped with a proprioceptive sensor measuring the joint angle or slider length. The robot is strongly 
underactuated so that it can not take on a wheel like form where locomotion were trivial. 

bias oscillations help the chain to better get out of impasse situations. We do not give details here, since 
we will study in the following an example that demonstrates the synergy effects even more convincingly. 

3.2.1 The Armband 

The ARMBAND considered here is a complicated physical object with 18 degrees of freedom, see Fig. [5] 
The physics of the robot is simulated realistically in the LpzRobots simulator l40ll . The program source 
code for this and the next simulation is available from BTI . Each joint is controlled by an individual 
controller, a single neuron driven by TiPI maximization, as with the robot chain treated in Der et al. [21 1. 
The controller receives the measured joint angle or slider position and the output of the controller defines 
the target joint angle or target slider position to be realized by the motor. The motors are implemented as 
simulated servomotors in order to be as close to reality as possible. Moreover, the forces are limited so that, 
due to the interaction with obstacles or the entanglement of the system's different degrees of freedom, the 
true joint angle may differ substantially from the target angle. These deviations drive the interplay between 
system and exploration dynamics. 

In the experiments, we use the controller given by eq. ( |23] l and the update rules for the parameter 
dynamics as given by eqs. ( |35] l and ( |36] l. The adaptive forward model is given by eq. |24} with T = and 
the appropriate learning rules eq. |25|. In order to demonstrate the constitutive role of the synergy effect, 
we started by studying the system with fixed C and h = 0. In contrast to the chain of mobile robots, with 
fixed parameters there is no parameter regime where the ARMBAND shows substantial locomotion. This 
result suggests that, as compared to the chain of mobile robots, the specific embodiment of the ARMBAND 
is more demanding for the emergence of the collective effect. 

In order to assess the effects appropriately, note that potential locomotion depends on the forces the 
motors are able to realize. For instance, if the robot is strongly actuated, the command a = for each of the 
motors drives each joint to its center position so that the shape of the robot is nearly circular, locomotion 
readily taking place under the influence of very weak external influences. In order to avoid such trivial 
effects, we use an underactuated setting so that gravitational or environmental forces are deforming the 
robot substantially, see Fig. [5] 

The situation changes drastically if the h dynamics is included. As demonstrated by Fig. [6] substantial 
locomotion sets in only if e is large enough so that the exploration dynamics is sufficiently fast for the 
synergy effect to unfold. Also, as the experiments show, the effect is stable for a very wide range of e and 
under varying external conditions. It is also notable, that the ARMBAND robot shows a definite reaction to 
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Figure 6: Role of the fast synaptic dynamics: depending on the speed of the synaptic dynamics defined 
by e, the locomotion properties are changing drastically. Depicted is the distance traveled by the robot 
in lOmin simulated time on an empty plane. The inset gives a close up view for low £, demonstrating that 
the locomotion starts only if £ exceeds a certain threshold value. Shown is the mean and standard deviation 
of 10 runs each. 




Figure 7: Regular locomotion pattern and interaction with the environment. Plotted are the center 
positions of the 6 rigid segments in space for an interval of 40 sec. One line is highlighted for visibility. 
The trajectory starts while the robot is moving to the left (A) and is hitting the wall (B) (black box) and 
locomotes to the right (C) showing a very regular pattern. Then it overcomes an obstacle (D) and hits the 
wall (E) and moves back (F). The behavior is cyclic. Parameter: £ = 0.5. 

external influences. For instance, obstacles in its path are either surmounted or cause the robot to invert its 
velocity, see Fig. [7] The latter effect is observed in particular in the underactuated regime defined above, 
so that the reflection is not the result of the elastic collision but it is actively controlled by the involvement 
of the exploration dynamics. The role of the latter is also demonstrated by the fact that locomotion stops 
as soon as the update rate £ is put to zero, see Fig.[8]and the corresponding video SI on |4T] . 

The ARMBAND has also been investigated recently using artificial evolution for the controller [47 1, 
demonstrating convincingly the usefulness of the evolution strategy for obtaining recurrent neural networks 
that make the ARMBAND roll into a given direction. There are several differences to our approach, both 
conceptually and in the results. While in the evolution strategy the fitness function was designed for the 
specific task and many generations were necessary to get the performance, in our approach the rolling 
modes are emerging right away by themselves. Moreover, the modes are sensitive to the environment, 
for instance by inverting velocity upon collisions with a wall, they are flexible (changing to a jumping 
behavior on several occasions) and resilient under widely differing physical conditions. Interestingly, these 
behaviors are achieved with an extremely simple neural controller, the functionality of a recurrent network 
being substituted by the fast synaptic dynamics. 

3.3 High dimensional case - the Humanoid 

Let us now study the properties of the exploration dynamics in a general (not decentralized) control task. 
We consider a humanoid robot with 17 degrees of freedom. Each joint is driven by a simulated servo motor, 
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Figure 8: Armband robot surmounting an obstacle and inverting speed at a wall. Screen shots from 
the simulation for Fig. [7] The order is row- wise from left to right. The last two pictures show the situation 
after switching off the parameter dynamics e = for a few seconds (the robots stops) and enabling it again 
(starts moving). 
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Figure 9: The Humanoid robot in four different scenarios. (A) Normal environment with flat ground. 
(B) The robot is hanging at a bungee like spring. (C) The robot is attached to a high bar. (D) Robot is 
fallen into a narrow pit. 



the motor values a £ M. sent by the controller are the target angles of the joints and sensor values s £ K 17 
are the true, observed angles. This is the only knowledge the robot has about its physical state. 

The aim of this experiment is to investigate in how far the robot develops behaviors with high variability 
so that it explores its sensorimotor contingencies. Given that there is no externally defined goal for the 
behavior development, will the robot develop a high behavioral variety depending on its physics and the 
environment it is dynamically embedded into? 

That this happens indeed is demonstrated by the videos S2, S3, S4 and S5 on [41 1. However, we want 
a more objective quantity to assess the relation between body and behavior. We provide two different 
measures for that purpose. One idea is to use the parameter constellation of the controller itself for charac- 
terizing the behavior — different behaviors should reflect in characteristic parameter configurations of the 
controller. In order to study this idea, we place the robot in different scenarios, see Fig. [9] always starting 
with the same initial parameter configuration (using the result of a preparatory learning phase in the bungee 
setting), letting the robot move independently for 40 min physical time. Without any additional noise, the 
dynamics is deterministic so that variations are introduced by starting the robot in different poses, i. e. in a 
straight upright position and in slightly tilted poses (0.5° and 5° slanted to the front). We then compared 
the parameter values of the controller matrix C at each second (Is) for all simulations and calculated a 
hierarchical clustering reflecting the differences between the C matrices. Figure 10 shows the resulting 
dendrogram. 

Obviously, there is a distinct grouping of the C matrices according to the environment the robot is in 
and the behaviors developing in the respective situation. Distances between the groups are different, the 
most pronounced group corresponding to the behavior in the pit situation. This seems plausible since the 
constraints are most distinctive here, driving the robot to behaviors that are markedly different from the 
situation with the bungee setting, say, where all joints (extremities, hip, back) can move much more freely. 
There is a second pronounced group — the robot clinging to the high bar — whereas the distances between 
the C matrices controlling the robot lying on the ground and hanging at the bungee rope is less pronounced. 



3 RESULTS 



17 




Figure 10: Parameter similarity for the behavior in different environments (Fig. |9j. Plotted is the 
results of a hierarchical clustering based on the difference between the parameters in each of the simulations 
(averaged over time). For each of the four environments there are three initial poses: 0° (straight upright), 
0.5° and 5° slanted to the front. The parameters for runs in the same environment are clustered together. 
This supports the observation that the embodiment plays an essential role in the generation of behavior. 
More importantly the physical conditions are reflected in the parameters and are thus internalized. We used 
the squared norm of the difference of the absolute values of the matrix elements. The absolute values were 
used because a common structure in the parameters are rotation matrices and there the same qualitative 
behavior is obtained with inverted signs. Parameters: e = 0.001 (rj = 0.005) 



However, by visual inspection the emerging behaviors in the two latter situations appear quite different 
(compare videos S2 and S3 on ATI ) — a finding that is not so clear in the matrix distance method. 

In order to get an additional measure we start from the idea that the TiPI maximization method produces 
a series of behaviors that are qualified by a high dynamical complexity generated in a controlled way. The 
latter point means that the dimensionality of the time series of the sensor values is much less than that of 
the mechanical system - if the behavior of the robot is well controlled (think of a walking pattern) a few 
master observables will be sufficient to describe the dynamics of the mechanical system. We have tried 
different methods from dynamical system theory for finding the effective dimension of that time series 
without much success. The reason was found to be in the strongly nonstationary nature of the compound 
dynamics (system plus exploration dynamics) making low dimensional behaviors to emerge and disappear 
in a rapid sequence. So, in the long run the full space of the dynamical system is visited so that globally a 
seemingly high dimensional behavior is observed. 

In order to cope with this nonstationary characteristic, we developed a different method, splitting the 
whole time series into chunks and using an elementary principal component analysis (PCA) in order to 
define the effective dimension in each chunk: on each chunk a PCA is performed and the number of 
principal components required to capture 95% of the data's variance is plotted (mean and standard deviation 
for all chunks of the same length). In order to avoid discretization artifacts we linearly interpolate the 
required number of components to obtain a real number. 

The results presented in Fig. 1 1 corroborate the above hypothesis on the dimensionality of the behav- 
iors. In particular, we observe the increase of the effective dimension if the chunk length is increasing, 
mixing different low dimensional behaviors. The latter point is made even more obvious in Fig.[T2|depict- 
ing the overlap between the behaviors in chunks at different times. This overlap is large if the behaviors 
are essentially the same and small if the behavior has changed in the time span between the chunks. As 
the figure demonstrates, the overlap is indeed large for short time spans, but behaviors can reemerge after 
some time. Altogether, the results demonstrate that our TiPI maximization method effectively explores the 
behavior space of high-dimensional robotic systems by exciting their low-dimensional modes, avoiding in 
this way the curse of dimensionality. 
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Figure 1 1 : Dimensionality of behavior on different time scales. Humanoid robot in bungee setup run- 
ning 40 min with different control settings. The sensor data is partitioned into chunks of a fixed length, the 
graph depicting the effective dimension over the length of the chunks for different settings. In order to test 
the method we start with a uniformly distributed noise signal for motor commands ("noise signal"). As 
expected the observed dimension is maximal. The sensor values produced by that random controller show 
a lower dimension ("noise Ctrl.") as is expected due to the low pass filtering property of the mechanical sys- 
tem. All other cases are with the TiPI maximization controller with different update rates e. In particular, 
the comparison with the e = case demonstrates that the exploration dynamics produces more complex 
behaviors than any fixed controller. 





Figure 12: Behavioral changes with time. Pairwise distances of chunks with length 10 s. Distance is 
defined as the length of the vector of maximal projections of the first 6 principal components. 
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4 Discussion 

Can a robot develop its skills completely on its own, driven by the sole objective to gain more and more 
information about its body and its interaction with the world? This question raises immediately further 
issues such as (i) what is the relevant information for the robot and (ii) how can one find a convenient 
update rule that realizes the gradient ascent on this information measure. We have studied the predictive 
information of the stream of sensor values as a tentative answer to the first question and, based on that, 
could give exact answers to the second question for simple cases. Earlier work was restricted to linear 
systems ||2) . In order to be applicable to actual robotic systems we extend it to the case of nonlinear 
controllers and to nonstationary processes leading to a new measure called TiPI (time-local predictive 
information). Using several approximations we have been still able to obtained analytical results. In this 
way we derived an explicit exploration dynamics for the controller parameters based on an information 
maximization principle, namely by maximizing the TiPI using gradient ascent. For neural networks the 
gradient yields a fast synaptic dynamics which is essentially local in nature. Interestingly the TiPI landscape 
(on which the gradient is calculated) continuously changes its shape due to the general destabilization of 
the system dynamics inherent in maximizing the TiPI. For instance if the system dynamics is in an attractor, 
increasing the TiPI destabilizes the attractor until it may disappear altogether with a complete restructuring 
of the TiPI landscape. This is another reason why nonstationary processes have to be handled and why no 
convergence of the parameter dynamics is desired. 

We studied a one-dimensional hysteresis system in order to work out the consequences of the nonsta- 
tionary. The parameter dynamics leads to a slightly supercritical regime and additionally a self-induced 
hysteresis oscillation emerges. This is a useful new property as shown in the experiment with the ARM- 
BAND robot, a high-dimensional robot with a complicated dynamics. Despite the highly decentralized 
control — each joint is controlled individually — the robot develops coherent and global pattern of behav- 
ior. This is enabled by the continuous adaptation and spontaneous mutual cooperation of the individual 
controllers (hysteresis elements). We find the effect to be very robust against the speed of the exploration 
dynamics. Interestingly in the one-dimensional case the update formulas are independent of white noise 
and we can obtain an exploration dynamics in a fully deterministic system. 

The new theoretical basis also allows for controlling complex high-dimensional robotic systems. This 
is demonstrated by a series of experiments with the HUMANOID robot, now jointly controlled by a single 
high-dimensional controller. Given that there is no externally defined goal for the behavior development, 
will the robot develop a high behavioral variety depending on its physics and the environment it is dynami- 
cally embedded into? Our results support a positive answer to this question. We quantify the dimensionality 
and temporal structure of the behavior and find a succession of low-dimensional modes that increasingly 
explore the behavior space. Furthermore we show that environmental factors influence the internal as well 
as behavioral development. Without additional noise, the deterministic dynamics leads to an individual 
development which depends decisively on the particular experiences made during the lifetime. 

The exploration dynamics can be viewed as a self-directed search process, where the directions to 
explore are created from the dynamics of the system itself. Without a random component the changes 
of the parameters are deterministically given as a function of the sensor values and internal parameters 
in a certain time window. For an embodied system this means in particular that constraints, responses 
and current knowledge of the dynamical interaction with the environment can directly be used to advance 
further exploration. Randomness is replaced with spontaneity which we demonstrate to restrict the search 
space automatically to the physically relevant dimensions. Its effectiveness is shown in the HUMANOID 
experiments and we argue that this is a promising way to avoid the curse of dimensionality. 

What is the relation of the parameter dynamics described here to other work on maximizing information 
quantities in neural systems? Maximizing the mutual information between input and output of a neuron, 
known as InfoMax, yields a very similar parameter dynamics |5|. Interestingly, when applied to a feed- 
forward network an independent component analysis can be performed. Also similar rules have been 
obtained in Triesch [61 ] where the entropy of the output of a neuron was maximized under the condition of 
a fixed average output firing-rate Triesch ||6T| . The resulting dynamics is called intrinsic plasticity as it acts 
on the membrane instead of on the synaptic level and it was shown to result in the emergence of complex 
dynamical phenomena lfTTll3Tl l62 32). In Markovic and Gros ll36l[37l a related dynamics is obtained at 
the synaptic level of a feedback circuit realized by an autaptic (self) connection. In a recurrent network 
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of such neurons it was shown that any finite update rate (e in our case) destroys all attractors, leading to 
intermittently bursting behavior and self-organized chaos. 

Our work differs in two aspects. On the one hand, we use the information theoretical principle at 
the behavioral level of the whole system by maximizing the TiPI on the full sensorimotor loop, whereas 
they use it at the neuronal level. Nevertheless we manage to root the information paradigm back to the 
level of the synaptic dynamics of the involved neurons. On the other hand, as a direct consequence of 
that approach, there is no need to specify the average output activity of the neurons. Instead the latter is 
self-regulating by the closed loop setting. Independent of the specific realization, the general message is 
that these self-regulating neurons realize a specific working regime where they are both active and sensitive 
to influences of their environment. If embedded into a feedback setting many interesting phenomena are 
produced. Instead of studying them in internal (inside the "brain") recurrences, we embed such neurons 
into a feedback loop with complex physical systems where the self-active and highly responsive nature of 
these neurons produces similar phenomena at the behavioral level. 

In the current form, our approach is limited to the control of robots where the sensorimotor dynamics 
can be, in its essence, modeled by a simple feed-forward neural network. The parameter dynamics can also 
be calculated for more complex controllers, such as recurrent networks, which remains for future work. 
In this study only proprioceptive sensors measuring joint angles have been used. However, our newest 
experiences have shown that also other sensors e. g. current sensors, acceleration sensor or velocity sensors 
can be successfully integrated. 

To conclude, information theory is a powerful tool to express principles to drive autonomous systems 
because it is domain invariant and allows for an intuitive interpretation. We present for the first time, to 
our knowledge, a method linking information theoretic quantities on the behavioral level (sensor values) 
to explicit dynamical rules on the internal level (synaptic weights) in a systematic way. This opens new 
horizons for the applicability of information theory to the sensorimotor loop and autonomous systems. 
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A Estimating the TiPI 

As stated above we consider the TiPI on the process of error propagations because it allows us to derive 
explicit expressions. Thus we start with the definition of the error propagation to derive eq. ( fTT| and provide 
further insights. 

As a first step, using the notion of an orbit of the dynamical system defined by the map y/ : M." — > M." 
we define a sequence of states $ t i £ W 

s t , = ( S ,_ T ) (37) 

for any time t' within the time window t — X <t' <t starting from state i r _ T = s,_ T . y k {s) denotes the 
A;-fold iteration of the map \j/ with \jf^ (s) = s. We can consider s t i as the predicted state over t 1 — {t— t) 
time steps. In particular, the prediction over T steps is s t = \j/ T (st-r). 
The error propagation can now be defined as the difference 

8s t i = s,i — s t i (38) 

between the true state s t i, eq. (|§J, and the state $,/ obtained by the deterministic dynamics (y/), see Fig.[T] 
The dynamics of the 8s t i obeys the rul^j 

5* t /=L(^_ 1 ) +^ + 0(||&|| 2 ) (39) 

with starting state 8s r - T = and L(s) denoting the Jacobian matrix of y/. In the following we will use this 
approximation which is arbitrary good for infinitesimally small noise. Note that this dynamics corresponds 
to that of a linear system^] however with state dependent dynamical operator L. 

As a remark, in the case of finite noise, we can obtain a related exact rule by using the mean value 
theorem of differential calculus stating that under mild restrictions one can find a state s t > £ [s t i , s t r] so that 

8s tl =L{s t ,_ l )8s t ,_ l +^ (40) 

yields the exact dynamics of the multi-step prediction error 8s t . 

The interesting point now is that P (S t : St-i) (eq. Q) is equal to that of the process defined by the 
error propagation dynamic^ji. e. 

I x (S t :S t - 1 )=F(8S t :8S^ l ) (41) 



This result is central for the following arguments — we will make use of the fact that the dynamics eq. ([39 
is more easily treated to obtain explicit estimates for the TiPI and its gradient. 



5 Proof: Using ,y = we write 

Sty = s t , - S t , = y ) + 4- - Y (.y _ i ) 



= Y (ty_i + 5ty_i ) - y (!,-_! ) + 
= L{s,,_ l )8s,,_ i + ^ + om\ 1 ) 

6 In a linear system, L is independent of the state. In this case S,i = Ls,/_ 1 such that the dynamical evolution of Ss and s are the 
same. 

'Consider two random vectors S and S' together with the shifted vectors U = S + a and U' = S' +d '. Using that the probability 
distribution functions (pdf) p$ (s) and pu (u) obey pu(u) = pu {s + a) = p$ (s) one obtains H (S) = H(U). Analogously, the joint 
pdf's obey p vv , (u,u') = p uv {s + aj + a') = p ss , (s,s') so that H(S'\S) =H(U'\U). 
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Explicit expressions 



By iterating eq. ( [39) l we obtain an explicit expression for 8s t (using here and in the following L{t') for 
L(s,,)) 

5*=£Z.W(f-l)&_ 4 (42) 

k=0 

with 

L w (f-l)=L(r-l)---L(r-Ar),andL (0) =1 (43) 

for any /. In general it is very complicated to obtain the entropy of 8S t in realistic situations with high 
dimensional physical systems. Therefore we will base the further considerations on a convenient estimate 
of the latter. With white Gaussian noise, the process 8S t is Gaussian as well, i. e. 8S t ~ JV (0,E f ) (it is 
a linear combination of independent Gaussians), so that the entropy is given in terms of the covariance 
matrix L t of the random vector 8S t as |[T2l 

H T (8S t ) = 1 -ln\Z t \ + ^ln27te (44) 

[A | denoting the determinant of a square matrix A and 

E, = (8S t 8Sj) = j p (8s t ) 8s t 8sJ d8s t (45) 

is the covariance matrix of 8S t and p(8s t ) is the probability density distribution of the random variable 
8S t . Using eq. ( p2| i, explicit expressions for E can readily be obtained, see eq. ( |49| ) below. 
By the same arguments, the conditional entropy is defined, using eq. (|7), as 

H T (8S t 1 5S,_ 1) —H T (Z t ) — ^ln|D,| + ^\n2%e (46) 

with 

D t = (z t Zj)= [p($)te T d$ (47) 



where S denotes the process of the noise with p (£, ) being the probability density function of S ~ j¥ (Q,D t ). 
Thus we obtain the estimate of the TiPI as 

l z (SS, : SS t -i) = X - In |E, | - l - In \D, \ (48) 

which is the entropy of the state 8s minus that of the noise. 



White noise 



Explicit expressions revealing more details of the theory are obtained for the case of white noise, meaning 
<4,4 f 7> =0if r ^f', so that using eq. ( |4"2"| ) in eq. ( |4"5j ) yields 



T-1 

E = 

ir=0 



£L«D(L«) T (49) 



In particular, in the case of T = 2, the shortest nontrivial time window, we find 
L = D + LDL T . 
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It is also useful to introduce the transformed dynamical operatoj^ L = VD l Ly/T) which leads to E = 



I T (8S t : 8S,-i 



D and (using \VDM\/D\ = \MD\ = \M\ \D\) 
1 



■In 



k=0 ' ' 



(50) 



Interestingly, the L operators also exist if the overall noise strength A = ||<j;|| goes to zero, so that P stays 
finit^] although the defining entropies, conditioned on the state s r _ T , are equal to zero in the deterministic 
system. 



The linear case 

For linear systems explicit expressions for the PI were obtained in Ay et al. 0. In this case L is not 
dependent on the state s t of the system so that = L k in eq. ( |43j ). Using eq. ( |49| ), with T — > we 
reobtain the result^] of Ay et al. [2j. Under the additional assumption that L is a normal matrix and the 
noise is isotropic the explicit expression E = (I — LL T ) was obtained. 



B Explicit gradient step 

In order to derive the general gradient step on the TiPI based on eq. ( fT3| we need to calculate the derivative 
In |E f |. Considering any (square) matrix M depending on a single parameter 9 k of the set 9 we hav^j 
(see for example Magnus and Neudecker ll35l ) 

^-ln|M| = ~ 
dM 1 1 M T 



and 



M • 



ln|M|=£M/^=Tr 
';' 

so that, using E = E T = (5i5i T ) and omitting the time index 



d e k l ^ = T <{zwS 5s5s )) (51) 

By using the cyclic invariance of the trace we obtain from eq. ( |5T| i 

9 ■\n\Z t \ = (8sJZ- 1 4 a 8s t ) (52) 



d9 1 " \ 1 d0 

now valid for the entire set of parameters 6. By eq. ( |40] > we obtain (ignoring the dependence of £, on the 
parameter) 

d _ aL(/'-l)_ r ,, 5 _ 

M & " = -w & "- 1+L ^- 1 )5o & "- 1 

so that by iteration 

8 This corresponds to using a so-called whitening transformation on the state dynamics, replacing in eq. (40) the state vector Ss by 
a new vector 5a" = yD~^Ss so that the covariance matrix of the noise in the Sx dynamics is just the unit matrix. 

'introducing D = A~ 2 D where D stays finite with A — > 0, we have L = s/d^Ls/d = s/DLs/D since A cancels out. 

10 Note that all eigenvalues of the Jacobi matrix L must be less than one by absolute value so that the limes will exist. This 
requirement also guarantees that the conditioning on s,_ T looses its influence for X — > °°. 

"We write A for M _1 here and in the following. 
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where iffi (t — 1) is given in eq. ( |43j ). Using a T Wb = (W T aj J b, we write 

^InlE^g/s^m^^^A (53) 



where (E is symmetric) 

= (i^- 1 ) (f- l)) T S r - 1 ^ (54) 

Stipulating the self-averaging property of the stochastic gradient, see section One-shot gradients for details, 
we realize the update rule as 

t, 1 e T dL(t-l) „ 

A9 = e £ 5 Mf T -/+i L - gj>-l (55) 
Here we see again that T = 2 is the simplest non-trivial case where the sum consists of a single term. 
Characterizing the parameter dynamics 

In order to better characterize the parameter dynamics, let us consider for the moment E at the r. h. s. of 
eq. ( |5T| i to be some fixed, positive matrix (not depending on the parameters Ok). Then, we can write 

*(£^ T >)-£H£«* T ))-£(«4* 

(using the cyclic invariance of the trace in the last step). The update rule eq. ( fT3j ) becomes using again the 
self-averaging 

A9 = e~\\8 S \\l (56) 

where = a 1 M~ x a defines the length of a vector a in the metric given by M (considered fixed in the 
current gradient step). From eq. |56| it becomes obvious that following the gradient is to increase the norm 
of 8s in the E metric. 

C Neural networks — derivation of the update rule 



We derive the parameter dynamics for neural networks eq. ( |28| ) from the general parameter dynamics for the 
two-step time window given by eq. ( fT4| ). According to eq. ( 26} we have L — VG' (z)C + T with z = Cs + h 
and G'(z) = diag[g| (z), . . . ,g' m (z)]- Putting this into eq. ([14 1 yields (omitting the time indices) 

-AC,,- = 8u T ^— 8s 
e dCij 

= 8u T VG'^8s + 8u T V^C8s 
dQj dCij 

= (g'V t 8u) 8sj + 8u T V^C8s (57) 



The second term remains to be calculated. Because G' is a diagonal matrix the vectors on both sides of the 
derivative carry the index i such that we get 

8u T V^C8s = (8u Y v) j g'!(z)s j (C8s) j (58) 
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In the case of g (z) = tanh (z) we find, using g" (z) = —2g' t (z)gi (z) and a — g(z) 
dG' 



8u T V^—C8s = -2(8u T VG') (C8s) i a i Sj = -Y i a i Sj (59) 



■-'J 

with 



Y i = 2{C8s) l 8n i (60) 
8/it = (g'V j Su) (61) 



The final update rule follows by putting eq. ( |59l > and eq. ( |6T| into eq. ( |57| i 

AC,-/ = edfiidsj - EjiCLiSj (62) 
Analogously we obtain the parameter dynamics of h as 

-Mi = 8u T ^8s = 8u T V^-C8s 
e dhi dhj 

= (Su T v) i g' i '(z)(CSs) l = -r l a, (63) 
A more compact matrix notation can be obtained by introducing the diagonal matrix F 
r = diag[yi, — ,Yi\ 
and thus (reintroducing the time indices) 

^AC t = 8ii,8sJ_ , - I>,s7 (64) 

-Ah t — — r t a, (65) 
e 

In the case of arbitrary neuron activation functions g we obtain equivalent formula by defining 

Y, = -^-{CSs^Stn (66) 

„" 

Note the factor — 4^ is 2 in the case of g = tanh. 

SiSi 



In the derivation of eqs. |59| and ( |63] l we ignored the dependence of the state s in g' (Cs + h) on the 
parameters C and h. This dependence can be considered explicitly if the state is at a fixed point. In that 
case, a more detailed discussion in Der and Martius ifTTl (section 6.2) shows that the effect of the derivative 
can be condensed into the so-called sense parameter a multiplying y. Thus we replace y as 

y i <- aji (67) 

where a is an empirical constant, typically a > 1, by which the sensitivity of the sensorimotor dynamics to 
external perturbations can be regulated. This works also in more general cases like a limit cycle dynamics, 
see Der and Martius IfTTl . 

D Learning the inverse covariance matrix 

Note that the covariance matrix given in eq. ([15) can be easily obtained by the on-line update rule 

AE, = 77 (Ss,8sJ — E f l (68) 
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or 

E m = (1 - 77) E, + 775^ T (69) 

realizing a sampling over a restricted period of time. The update rate 77 defines the time horizon t h °^ 77 _1 
for the averaging. The only remaining nontrivial operation in that setting is the inversion of the covariance 
matrix E. However, this can also be reduced to elementary operations by using the Sherman-Morrison 
formula as given by 



(a + uv t ) 1 =A~ X 1 ^-A- 1 uv T A- 1 

\ ) 1+v'A _1 m 



. + 1 

Putting A = (1 — 77) E and uv = r\8s8s we get 

((1 -77)^ + 775^7)^= — \ — z; l 8s t 8sj^ 

and thus 

jr. 1 , = - J*— jr 1 Ss t SsJ Lr 1 

(+i 1-77' 1-77 ' ' ' 

where j3 £ R is given by 

fl = ^ 

(1-77 + 775^-15^) 

Note that 5*7 ^i 1 Ss t featuring in the denominator of j3 is a scalar so that with Ej -1 given there is no matrix 
inversion to be done. 

If E, is an n x n matrix, the cost of getting E f+ i is O (n 2 ) . This is very favorable if the dimension of the 
sensor space is large. Using the above formula, the only true inversion (of order O (« 3 )) has to be done just 
once, when starting the process (with a convenient initialization of E). 



